Adder Neural Network (AdderNet) provides a new way for developing energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i.e.l1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in l1-norm, the well-established quantization methods on convolutional networks cannot be applied on AdderNets. Thus, the existing AdderNet quantization techniques propose to use only one shared scale to quantize both the weights and activations simultaneously. Admittedly, such an approach can keep the commutative law in the l1-norm quantization process, while the accuracy drop after low-bit quantization cannot be ignored. To this end, we first thoroughly analyze the difference on distributions of weights and activations in AdderNet and then propose a new quantization algorithm by redistributing the weights and the activations. Specifically, the pre-trained full-precision weights in different kernels are clustered into different groups, then the intra-group sharing and inter-group independent scales can be adopted. To further compensate the accuracy drop caused by the distribution difference, we then develop a lossless range clamp scheme for weights and a simple yet effective outliers clamp strategy for activations. Thus, the functionality of full-precision weights and the representation ability of full-precision activations can be fully preserved. The effectiveness of the proposed quantization method for AdderNet is well verified on several benchmarks, e.g., our 4-bit post-training quantized adder ResNet-18 achieves an 66.5% top-1 accuracy on the ImageNet with comparable energy efficiency, which is about 8.5% higher than that of the previous AdderNet quantization methods.
translated by 谷歌翻译
网络架构在基于深度学习的计算机视觉系统中起关键作用。广泛使用的卷积神经网络和变压器将图像视为网格或序列结构,该网格或序列结构并非灵活以捕获不规则和复杂的对象。在本文中,我们建议将图像表示为图形结构,并引入新的视觉GNN(VIG)体系结构,以提取视觉任务的图形级特征。我们首先将图像拆分为许多被视为节点的补丁,然后通过连接最近的邻居来构造图形。根据图像的图表表示,我们构建了VIG模型以在所有节点之间转换和交换信息。 VIG由两个基本模块组成:用于汇总和更新图形信息的图形卷积的图形模块,以及带有两个线性层的FFN模块用于节点特征转换。 VIG的各向同性和金字塔体系结构均具有不同的型号。关于图像识别和对象检测任务的广泛实验证明了我们的VIG架构的优势。我们希望GNN关于一般视觉任务的开创性研究将为未来的研究提供有用的灵感和经验。 pytorch代码可在https://github.com/huawei-noah/effficity-ai-backbones上获得,Mindspore代码可在https://gitee.com/mindspore/models上获得。
translated by 谷歌翻译
由于存储器和计算资源有限,部署在移动设备上的卷积神经网络(CNNS)是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络,这很少在神经结构设计中进行了研究。对于类似CPU的设备,我们提出了一种新颖的CPU高效的Ghost(C-Ghost)模块,以生成从廉价操作的更多特征映射。基于一组内在的特征映射,我们使用廉价的成本应用一系列线性变换,以生成许多幽灵特征图,可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件,以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块,然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下,不涉及太多的GPU效率(例如,深度明智的卷积),我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵(G-GHOST)阶段结构。舞台中的特征被分成两个部分,其中使用具有较少输出通道的原始块处理第一部分,用于生成内在特征,另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。
translated by 谷歌翻译
最近的工作表明,二值化的神经网络(BNN)能够大大降低计算成本和内存占用空间,促进在资源受限设备上进行模型部署。然而,与其全精密对应物相比,BNN患有严重的精度降解。旨在降低这种精度差距的研究已经很大程度上主要集中在具有少量或没有1x1卷积层的特定网络架构上,标准二值化方法不起作用。由于1x1卷积在现代架构的设计中是常见的(例如,Googlenet,Reset,DenSenet),开发一种方法以有效地为BNN进行更广泛采用的方法是至关重要的。在这项工作中,我们提出了一个“弹性链路”(EL)模块,通过自适应地将实值的输入特征自适应地添加到后续卷积输出功能来丰富了BNN内的信息流。所提出的EL模块很容易实现,并且可以与BNN的其他方法结合使用。我们证明将EL添加到BNNS对挑战大规模想象数数据集产生显着改进。例如,我们将二值化resnet26的前1个精度从57.9%提高到64.0%。 EL也有助于培训二值化Mobilenet的趋同,为此实现了56.4%的前1个精度。最后,随着RESTNET的整合,它产生了新的最新的最新性,最新的171.9%的前1个精度。
translated by 谷歌翻译
深度卷积神经网络(CNNS)通常是复杂的设计,具有许多可学习的参数,用于准确性原因。为了缓解在移动设备上部署它们的昂贵成本,最近的作品使挖掘预定识别架构中的冗余作出了巨大努力。然而,尚未完全研究现代CNN的输入分辨率的冗余,即输入图像的分辨率是固定的。在本文中,我们观察到,用于准确预测给定图像的最小分辨率使用相同的神经网络是不同的。为此,我们提出了一种新颖的动态分辨率网络(DRNET),其中基于每个输入样本动态地确定输入分辨率。其中,利用所需网络共同地探索具有可忽略的计算成本的分辨率预测器。具体地,预测器学习可以保留的最小分辨率,并且甚至超过每个图像的原始识别准确性。在推断过程中,每个输入图像将被调整为其预测的分辨率,以最小化整体计算负担。然后,我们对几个基准网络和数据集进行了广泛的实验。结果表明,我们的DRNET可以嵌入到任何现成的网络架构中,以获得计算复杂性的相当大降低。例如,DR-RESET-50实现了类似的性能,计算减少约34%,同时增加了1.4%的准确度,与原始Resnet-50上的计算减少相比,在ImageNet上的原始resnet-50增加了10%。
translated by 谷歌翻译
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译